1 results
9 - Recognising Groups among Dialects
-
- By Jelena Prokić, University of Groningen, John Nerbonne, University of Groningen
- Edited by John Nerbonne, University of Groningen, Charlotte Gooskens, University of Groningen, Sebastian Kürschner, Friedrich-Alexander-Universität Erlangen-Nürnberg, Renée van Bezooijen, University of Groningen
-
- Book:
- Computing and Language Variation
- Published by:
- Edinburgh University Press
- Published online:
- 12 September 2012
- Print publication:
- 04 December 2009, pp 153-172
-
- Chapter
- Export citation
-
Summary
Abstract In this paper we apply various clustering algorithms to the dialect pronunciation data. At the same time we propose several evaluation techniques that should be used in order to deal with the instability of the clustering techniques. The results have shown that three hierarchical clustering algorithms are not suitable for the data we are working with. The rest of the tested algorithms have successfully detected two-way split of the data into the Eastern and Western dialects. At the aggregate level that we used in this research, no further division of sites can be asserted with high confidence.
INTRODUCTION
Dialectometry is a multidisciplinary field that uses various quantitative methods in the analysis of dialect data. Very often those techniques include classification algorithms such as hierarchical clustering algorithms used to detect groups within certain dialect area. Although known for their instability (Jain and Dubes, 1988), clustering algorithms are often applied without evaluation (Goebl, 2007; Nerbonne and Siedle, 2005) or with only partial evaluation (Moisl and Jones, 2005). Very small differences in the input data can produce substantially different grouping of dialects (Nerbonne et al., 2008). Without proper evaluation, it is very hard to determine if the results of the applied clustering technique are an artifact of the algorithm or the detection of real groups in the data.
The aim of this paper is to evaluate algorithms used to detect groups among language dialect varieties measured at the aggregate level. The data used in this research is dialect pronunciation data that consists of various pronunciations of 156 words collected all over Bulgaria.